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Maximum likelihood fits to data can be done using binned data (histograms) and unbinned data. With binned 
data, one gets not only the fitted parameters but also a measure of the goodness of fit. With unbinned data, 
currently, the fitted parameters are obtained but no measure of goodness of fit is available. This remains, to 
date, an unsolved problem in statistics. Using Bayes' theorem and likelihood ratios, we provide a method by 
which both the fitted quantities and a measure of the goodness of fit are obtained for unbinned likelihood fits, 
as well as errors in the fitted quantities. The quantity, conventionally interpreted as a Bayesian prior, is seen in 
this scheme to be a number not a distribution, that is determined from data. 



1. Introduction 



2. Likelihood ratios 



x 



As of the Durham conference the problem of 
obtaining a goodness of fit in unbinned likelihood fits 
was an unsolved one. In what follows, we will de- 
note by the vector s, the theoretical parameters (s for 
"signal") and the vector c, the experimentally mea- 
sured quantities or "configurations" . For simplicity 
we will illustrate the method where both s and c are 
one dimensional, though either or both can be multi- 
dimensional in practice. We thus define the theo- 
retical model by the conditional probability density 
P(c\s). Then an unbinned maximum likelihood fit to 
data is obtained by maximizing the likelihood |2|, 



C = Y[P( Cl \s) 



(1) 



i=l 



where the likelihood is evaluated at the n observed 
data points Cj,i = l,n. Such a fit will determine 
the maximum likelihood value s* of the theoretical 
parameters, but will not tell us how good the fit is. 
The value of the likelihood C at the maximum like- 
lihood point does not furnish a goodness of fit, since 
the likelihood is not invariant under change of vari- 
able. This can be seen by observing that one can 
transform the variable set c to a variable set c' such 
that P(c'\s*) is uniformly distributed between and 
1. Such a transformation is known as a hypercube 
transformation, in multi-dimensions. Other datasets 
will yield different values of likelihood in the variable 
space c when the likelihood is computed with the orig- 
inal function P(c\s*). However, in the original hyper- 
cube space, the value of the likelihood is unity regard- 
less of the dataset c[,i — l,n, thus the likelihood C 
cannot furnish a goodness of fit by itself, since neither 
the likelihood, nor ratios of likelihoods computed us- 
ing the same distribution P(c\s*) is invariant under 
variable transformations. The fundamental reason for 
this non-invariance is that only a single distribution, 
namely, P(c\s*) is being used to compute the goodness 
of fit. 



In binned likelihood cases, where one is comparing 
a theoretical distribution P(c\s) with a binned his- 
togram, there are two distributions involved, the theo- 
retical distribution and the data distribution. The pdf 
of the data is approximated by the bin contents of the 
histogram normalized to unity. If the data consists of 
n events, the pdf of the data p data ( c ) is defined in the 
frequentist sense as the normalized density distribu- 
tion in c space of n events as n — > oo. In the binned 
case, we can bin in finer and finer bins as n — > oo and 
obtain a smooth function, which we define as the pdf 
of the data P data (c). In practice, one is always lim- 
ited by statistics and the binned function will be an 
approximation to the true pdf . We can now define a 
likelihood ratio C-ji such that 



UZl Pdata ( C *) Pdata (Cn) 



(2) 



where we have used the notation c n to denote the 
event set a,i = l,n. Let us now note that C-r, is 
invariant under the variable transformation c — > c', 
since 



P(c'\s) = \^\P(c\s) 



dc 



paataii\ _ I \P (c) 

dc' 



(3) 

(4) 
(5) 



and the Jacobian of the transformation \4§r\ cancels 
in the numerator and denominator in the ratio. This 
is an extremely important property of the likelihood 
ratio Cn that qualifies it to be a goodness of fit vari- 
able. Since the denominator P data (c n ) is independent 
of the theoretical parameters s, both the likelihood ra- 
tio and the likelihood maximize at the same point s*. 
One can also show Q that the maximum value of the 
likelihood ratio occurs when the theoretical likelihood 
P(ci\s) and the data likelihood P data (ci) are equal for 
all c,;. 
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3. Binned Goodness of Fit 

In the case where the pdf p data ( c ) is estimated by 
binned histograms and the statistics are Gaussian, it 
is readily shown Q that the commonly used goodness 
of fit variable x 2 = —2logCn. It is worth emphasizing 
that the likelihood ratio as defined above is needed 
and not just the negative log of theoretical likelihood 
P(c n \s) to derive this result. The popular conception 
that x 2 is -2 log P(c n \s) is simply incorrect!. It can 
also be shown that the likelihood ratio defined above 
can describe the binned cases where the statistics are 
Poissonian 4] . In order to solve our problem of good- 
ness of fit in unbinned likelihood cases, one needs to 
arrive at a method of estimating the data pdf p data (c) 
without the use of bins. 



4. Unbinned Goodness of Fit 

One of the better known methods of estimating the 
probability density of a distribution in an unbinned 
case is by the use of Probability Density Estimators 
(PDE 1 s), also known as Kernel Density Estimators 5] 
(KDE's). The pdf P data ( c ) is approximated by 

1 i—n 

P data (c)^PDE(c) = -Yg(c~ Cl ) (6) 

i=i 

where a Kernel function Q(c — Cj) is centered around 
each data point a, is so defined that it normalizes to 
unity and for large n approaches a Dirac delta func- 
tion The choice of the Kernel function can vary 
depending on the problem. A popular kernel is the 
Gaussian defined in the multi-dimensional case as 



5(c) ^ 



1 t -H a P C a c p . _ 

: ex p( — sia — ) ( 7 ) 



(V^h) d ^(det(E)) 2h 2 
where E is the error matrix of the data defined as 

Eu.fi =< c « c /3 > - < c a >< c p > (8) 

and the <> implies average over the n events, and 
d is the number of dimensions. The Hessian matrix 
H is defined as the inverse of E and the repeated 
indices imply summing over. The parameter h is a 
"smoothing parameter", which has|6( a suggested op- 
timal value h oc n~ 1 /^ d+A '\ that satisfies the asymp- 
totic condition 

Qoo{c-Ci)= lim Q(c— cA = S(c— cA (9) 

n — >oo 

The parameter h will depend on the local number den- 
sity and will have to be adjusted as a function of the 
local density to obtain good representation of the data 
by the PDE. Our proposal for the goodness of fit in 
unbinned likelihood fits is thus the likelihood ratio 

P(c n |s) P(C„|S) 



c 



■R 



P data (c n ) P PDE (c n ) 
evaluated at the maximum likelihood point s* 



(10) 



5. An illustrative example 

We consider a simple one-dimensional case where 
the data is an exponential distribution, say decay 
times of a radioactive isotope. The theoretical pre- 
diction is given by 



P(c|*) = iexp(--) 
s s 



(11) 



We have chosen an exponential with s = 1.0 for this 
example. The Gaussian Kernel for the PDE would 
be given by 



G(c) 



1 



(V2nah) 



exp(- 



2a 2 h 2 



(12) 



where the variance a of the exponential is numerically 
equal to s. To begin with, we chose a constant value 
for the smoothing parameter, which for 1000 events 
generated is calculated to be 0.125. Figure ^ shows 
the generated events, the theoretical curve P(c\s) and 
the PDE curve P(c) normalized to the number of 
events. The PDE fails to reproduce the data near 
the origin due to the boundary effect, whereby the 
Gaussian probabilities for events close to the origin 
spill over to negative values of c. This lost probability 
would be compensated by events on the exponential 
distribution with negative c if they existed. In our 
case, this presents a drawback for the PDE method, 
which we will remedy later in the paper using PDE 
definitions on the hypercube and periodic boundary 
conditions. For the time being, we will confine our 
example to values of c > 1.0 to avoid the boundary 
effect. 

In order to test the goodness of fit capabilities of 
the likelihood ratio £n, we superimpose a Gaussian 
on the exponential and try and fit the data by a 
simple exponential. Figure [21 shows the "data" with 
1000 events generated as an exponential in the fiducial 
range 1.0 < c < 5.0. Superimposed on it is a Gaus- 
sian of 500 events. More events in the exponential 
are generated in the interval 0.0 < c < 1.0 to avoid 
the boundary effect at the fiducial boundary at c=1.0. 
Since the number density varies significantly, we have 
had to introduce a method of iteratively determining 
the smoothing factor as a function of c as described 
in [3j. With this modification in the PDE, one gets 
a good description of the behavior of the data by the 
PDE as shown in Figure [3 We now vary the num- 
ber of events in the Gaussian and obtain the value of 
the negative log likelihood ratio NCCR, as a function 
of the strength of the Gaussian. Table Q] summarizes 
the results. The number of standard deviations the 
unbinned likelihood fit is from what is expected is de- 
termined empirically by plotting the value of NCC1Z 
for a large number of fits where no Gaussian is super- 
imposed (i.e. the null hypothesis) and determining the 
mean and RMS of this distribution and using these 



MOCT003 



PHYSTAT2003, SLAC, Stanford, California, September 8-11,2003 



/D07/36/06 12.54 



Generated events and PDE comparison 




Table I Goodness of fit results from unbinned likelihood 
and binned likelihood fits for various data samples. The 
negative values for the number of standard deviations in 
some of the examples is due to statistical fluctuation. 
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Figure 1: Figure shows the histogram (with errors) of 
generated events. Superimposed is the theoretical curve 
P(c\s) and the PDE estimator (solid) histogram with no 
errors. 
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Generated events and PDE 




Time (arbitrary units) 



It can be seen that -log P(c n |s) and -log P PDE (c n ) 
are correlated with each other and the difference be- 
tween the two (-log MCC1Z) is a much narrower dis- 
tribution than either and provides the goodness of fit 
discrimination. 
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Figure 2: Figure shows the histogram (with errors) of 
1000 events in the fiducial interval 1.0 < c < 5.0 
generated as an exponential with decay constant s=1.0 
with a superimposed Gaussian of 500 events centered at 
c=2.0 and width=0.2. The PDE estimator is the (solid) 
histogram with no errors. 



to estimate the number of cr's the observed NCCR, is 
from the null case. Table [i] also gives the results of 
a binned fit on the same "data" . It can be seen that 
the unbinned fit gives a 3a discrimination when the 
number of Gaussian events is 85, where as the binned 
fit gives a x 2 /ndf of 42/39 for the same case. We in- 
tend to make these tests more sophisticated in future 
work. 

Figure|3]shows the variation of -log P(c n \s) and -log 
P PDE (c n ) for an ensemble of 500 experiments each 
with the number of events n = 1000 in the exponen- 
tial and no events in the Gaussian (null hypothesis). 



Figure 3: (a) shows the distribution of the negative 
log-likelihood -log e (P(c n \s)) for an ensemble of 
experiments where data and experiment are expected to 
fit. (b) Shows the negative log PDE likelihood 
-log e (P(c n )) for the same data (c) Shows the correlation 
between the two and (d) Shows the negative 
log-likelihood ratio NCC1Z that is obtained by 
subtracting (b) from (a) on an event by event basis. 



5.1. Improving the PDE 

The PDE technique we have used so far suffers from 
two drawbacks; firstly, the smoothing parameter has 
to be iteratively adjusted significantly over the full 
range of the variable c, since the distribution P(c\s) 
changes significantly over that range; and secondly, 
there are boundary effects at c=0 as shown in fig- 
ure n Both these flaws are remedied if we define the 
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PDE in hypercube space. After we find the maxi- 
mum likelihood point s*, for which the PDE is not 
needed, we transform the variable c — » c', such that 
the distribution P(c'\s*) is flat and < d < 1. The 
hypercube transformation can be made even if c is 
multi-dimensional by initially going to a set of vari- 
ables that are uncorrelated and then making the hy- 
percube transformation. The transformation can be 
such that any interval in c space maps on to the inter- 
val (0, 1) in hypercube space. We solve the boundary 
problem by imposing periodicity in the hypercube. In 
the one dimensional case, we imagine three "hyper- 
cubes", each identical to the other on the real axis 
in the intervals (-1,0), (0,1) and (1,2). The hyper- 
cube of interest is the one in the interval (0, 1). When 
the probability from an event kernel leaks outside the 
boundary (0, 1), we continue the kernel to the next hy- 
percube. Since the hypercubes are identical, this im- 
plies the kernel re-appearing in the middle hypercube 
but from the opposite boundary. Put mathematically, 
the kernel is defined such that 
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Figure 4: The distribution of the negative log likelihood 
ratio NCC1Z for the null hypothesis for an ensemble of 
500 experiments each with 1000 events, as a function of 
the smoothing factor ft=0.1, 0.2 and 0.3 
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(13) 
(14) 



Although a Gaussian Kernel will work on the hyper- 
cube, the natural kernel to use considering the shape 
of the hypercube would be the function G(d) 



S( C ') = i; \d\<\ 

G(d)=0; \d\> ~ 



(15) 
(16) 



This kernel would be subject to the periodic boundary 
conditions given above, which further ensure that ev- 
ery event in hypercube space is treated exactly as ev- 
ery other event irrespective of their co-ordinates. The 
parameter h is a smoothing parameter which needs to 
be chosen with some care. However, since the theory 
distribution is flat in hypercube space, the smoothing 
parameter may not need to be iteratively determined 
over hypercube space to the extent that data distri- 
bution is similar to the theory distribution. Even if 
iteration is used, the variation in h in hypercube space 
is likely to be much smaller. 

Figure 0] shows the distribution of the AfCClZ for 
the null hypothesis for an ensemble of 500 experiments 
each with 1000 events as a function of the smoothing 
factor h. It can be seen that the distribution narrows 
considerably as the smoothing factor increases. We 
choose an operating value of 0.2 for h and study the 
dependence of the JVCCIZ as a function of the number 
of events ranging from 100 to 1000 events, as shown in 
figure [5J The dependence on the number of events is 
seen to be weak, indicating good behavior. The PDE 
thus arrived computed with h=0.2 can be transformed 
from the hypercube space to c space and will repro- 
duce data smoothly and with no edge effects. We note 



that it is also easier to arrive at an analytic theory of 
AfCClZ with the choice of this simple kernel. 
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Figure 5: The distribution of the negative log likelihood 
ratio NCC1Z for the null hypothesis for an ensemble of 
500 experiments each with the smoothing factor /i=0.2, 
as a function of the number of events 



6. End of Bayesianism? 

By Bayesianism, we mean the practice of "guess- 
ing" a prior distribution and introducing it into the 
calculations. In what follows we will show that what 
is conventionally thought of as a Bayesian prior dis- 
tribution is in reality a number that can be calculated 
from the data. We are able to do this since we use 
two pdf's, one for theory and one for data. In what 
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follows, we will interpret the probability distribution 
of the parameter s in a strictly frequentist sense. The 
pdf of s is the distribution of the best estimator of the 
true value st of s from an ensemble of an infinite num- 
ber of identical experiments with the same statistical 
power n. 

6.1. Calculation of fitted errors 

After the fitting is done and the goodness of fit is 
evaluated, one needs to work out the errors on the 
fitted quantities. One needs to calculate the posterior 
density P(s|c n ), which carries information not only 
about the maximum likelihood point s* , from a single 
experiment, but how such a measurement is likely to 
fluctuate if we repeat the experiment. The joint prob- 
ability density P(s,c n ) of observing the parameter s 
and the data c n is given by 



P data (s,c n )=P( S \c n )P data (c n 



data 



(17) 



where we use the superscript data to distinguish the 
joint probability p data (s, c n ) as having come from us- 
ing the data pdf. If we now integrate the above equa- 
tion over all possible datasets c n , we get the expres- 
sion for the pdf of s. 

Pn(s) = J P data (s,C n )dc n = J P( S \c n )P data ( Cn )dc n 

(18) 

where we have used the symbol V to distinguish the 
fact that it is the true pdf of s obtained from an in- 
finite ensemble. We use the subscript n in V n (s) to 
denote that the pdf is obtained from an ensemble of 
experiments with n events each. Later on we will 
show that V n (s) is indeed dependent on n. Equa- 
tion fR| states that in order to obtain the pdf of the 
parameter s, one needs to add together the conditional 
probabilities P(s|c n ) over an ensemble of events, each 
such distribution weighted by the "data likelihood" 
P data (c n ). At this stage of the discussion, the func- 
tions P data (s\c n ) are unknown functions. We have 
however worked out £iz(s) as a function of s and have 
evaluated the maximum likelihood value s* of s. We 
can choose an arbitrary value of s and evaluate the 
goodness of fit at that value using the likelihood ra- 
tio. When we choose an arbitrary value of s, we are 
in fact hypothesizing that the true value st is at this 
value of s. Lr(s) then gives us a way of evaluating the 
relative goodness of fit of the hypothesis as we change 
s. Let us now take an arbitrary value of s and hy- 
pothesize that that is the true value. Then the joint 
probability of observing c n and st being at this value 
of s is given from the data end by eauation ll7l 

Similarly, from the theoretical end, one can calcu- 
late the joint probability of observing the dataset c„, 
with the true value being at s. The true value st is 



taken to be the maximum likelihood point of the pdf 
V n (s). It may coincide with the mean value of the 
pdf V n {s). These statements are assertions of the un- 
biased nature of the data from the experiment. At 
this point, there is no information available on where 
the true value st lies. One can make the hypothesis 
that a particular value of s is the true value and the 
probability of obtaining a best estimator s* from ex- 
periments of the type being performed in the interval 
st and st + dsr is V n {sT)dsT- The actual value of 
this number is a function of the experimental resolu- 
tion and the statistics n of the experiment. The joint 
probability P theory (s , c n ) from the theoretical end is 
given by the product of the probability density of the 
pdf of s at the true value of s, namely P„(st), and 
the theoretical likelihood P(c n \s) evaluated at the true 
value, which by our hypothesis is s. 



ptheoi 



P theor (c n \ s )V n (s T ) 



(19) 



The joint probability P(s, c n ) is a joint distribution 
of the theoretical parameter s and data c n . The two 
ways of evaluating this (from the theoretical end and 
the data end) must yield the same result, for consis- 
tency. This is equivalent to equating P data (s,c n ) and 
P theor (s, c n ). This gives the equation 

P( S |c n )P datQ (c n ) = P theor (c n \ S )V n ( ST ) (20) 

which is a form of Bayes' theorem, but with two pdf s 
(theory and data). Let us note that the above equa- 
tion can be immediately re-written as a likelihood ra- 
tio 



P(s|c n ) = P the ° r (c a \ S ) 

V»(s T ) P data (c n ) 



(21) 



which is what is used to obtain the goodness of fit. 
In order to get the fitted errors, we need to evaluate 
P(s\c n ) which necessitates a better understanding of 
what V n {sT) is in equation 1201 Rearranging equa- 
tion |20| one gets 



P( S |c n ) = Cn{s)V n {s T ) = _L^U p w ( 8r ) 



P data (c n ) 



(22) 



6.1 .1 . To show that P„(st) depends on n 

In practice, in both the binned and unbinned cases, 
one only has an approximation to P data (c n ). As n — > 
co, in the absence of experimental bias, one expects 
to determine the parameter set s to infinite accuracy; 
and P(s|c n ) — * d(s — st), where st is the true value 
of s. However, for the null hypothesis, as n — ► oo, 
the statistical error introduced by our use of PDE in 
the unbinned case or by binning in the binned case 
becomes negligible with the result that the theory pdf 
describes the data for all c at the true value st- i.e. 



P 



theor 



(c\s T ) 



pdata {q\ 



1 as n - 



(23) 
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When one evaluates the likelihood ratio C-r over n 
events, with n — > oo, the likelihood ratio does not 
necessarily remain unity. This is due to fluctuations 
in the data which grow as \f{n). For the binned likeli- 
hood case with n\> bins, one can show that as n — > oo, 



C 



,-"6/2 



(24) 



This is just an example of the likelihood ratio theo- 
rem. If one uses a binned x 2 n t- which can also be 
thought of as maximizing a likelihood ratio, one gets 
the same limit as when using binned likelihood fits. 
The point is that L-n is finite as n — > oo. In the 
unbinncd case, we have currently no analytic theory 
available. However, one can argue that the binned 
case with the number of bins ni, — > oo and rib « n 
should approach the unbinned limit. In this case, the 
unbinncd C-jz also is finite for infinite statistics. This 
implies that V n (sT) -> oo as n -> oo. i.e V n {sT) 
depends on n. This puts an end to the notion of a 
monolithic Bayesian prior interpretation for V n {s). 

6.1 .2. To show that P n {s T ) is constant with respect to s 

When one varies the likelihood ratio in equation 1221 
as a function of s, for each value of s, one is mak- 
ing a hypothesis that s — st- As one changes s, a 
new hypothesis is being tested that is mutually exclu- 
sive from the previous one, since the true value can 
only be at one location. So as one changes s, one is 
free to move the distribution P n (s) so that st is at 
the value of s being tested. This implies that V n (sT) 
does not change as one changes s and is a constant 
wrt s, which we can now write as a n . Figure illus- 
trates these points graphically. Thus P n (sr) in our 
equations is a number, not a function. The distri- 
bution P n {s) should not be thought of as a "prior" 
but as an "unknown concomitant" , which depends on 
the statistics and the measurement capabilities of the 
apparatus. For a given apparatus, there are a denu- 
merable infinity of such distributions, one for each n. 
These distributions become narrower as n increases 
and V n (sT) — > oo as n — > oo. 

6.2. New form of equations 

Equation E2 can now be re-written 
P{c n \s)a n 



P{s\c a ) = 



P data (c n ) 



(25) 



Since P(s|c n ) must normalize to unity, one gets for 



jdctta i 



c„) 



1 



/ P(c n \s)ds J Cn(s) ds 



(26) 



We have thus determined a n , the value of the "un- 
known concomitant" at the true value st using our 
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Figure 6: Comparison of the usage of Bayesian priors 
with the new method. In the upper figure, illustrating the 
Bayesian method, an unknown distribution is guessed at 
by the user based on "degrees of belief" and the value of 
the Bayesian prior changes as the variable s changes. In 
the lower figure, an "unknown concomitant" distribution 
is used whose shape depends on the statistics. In the 
case of no bias, this distribution peaks at the true value 
of s. As we change s, we change our hypothesis as to 
where the true value of s lies, and the distribution shifts 
with s as explained in the text. The value of the 
distribution at the true value is thus independent of s. 



data set c„. This is our measurement of a n and dif- 
ferent datasets will give different values of a n , in other 
words a n will have a sampling distribution with an ex- 
pected value and standard deviation. As n — > oo, the 
likelihood ratio C-r will tend to a finite value at the 
true value and zero for all other values, and a n — > oo 
as a result. 

Note that it is only possible to write down an ex- 
pression for a n dimensionally when a likelihood ratio 
C-r is available. This leads to 



P(s\c n ) = 



C 



■Ti 



.. = P(Cn\s) 

J C-r. ds J P(c n \s)ds 



(27) 



The last equality in equation [23 is the same expres- 
sion that "frequentists" use for calculating their errors 
after fitting, namely the likelihood curve normalized 
to unity gives the parameter errors. If the likelihood 
curve is Gaussian shaped, then this justifies a change 
of negative log- likelihood of \ from the optimum point 
to get the lcr errors. Even if it is not Gaussian, as we 
show in section (jSJ, we may use the expression for 
P(s\c a ) as a pdf of the parameter s to evaluate the 
errors. 

The normalization condition 



P(Cr 



p theor y( s ,c n )ds 



P(c n \s)V n (s T )ds 
(28) 
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}data 



(Cn) 

(29) 



is obeyed by our solution, since 
/ P(c a \s)V n {s T ) ds = J a„P(c n |s) ds = P 

The expression J a n P(c n |s) ds in the above equation 
may be thought of as being due to an "unknown con- 
comitant" whose peak value is distributed uniformly 
in s space. The likelihoods of the theoretical predic- 
tion P(c n |s) contribute with equal probability each 
with a weight a n , to sum up to form the data like- 
lihood P data (c n ). i.e. the data, due to its statistical 
inaccuracy will entertain a range of theoretical param- 
eters. However, equation does not give us any fur- 
ther information, since it is obeyed identically. Fitting 
for the maximum likelihood value s* of s is attained 
by maximizing the likelihood ratio C-jz — J^it^j^ ) • 
The goodness of fit is obtained using the value of C-r 
at the maximum likelihood point. The best theoret- 
ical prediction is P(c\s*), and this prediction is used 
to compare to the data pdf P data (c). Note that the 
maximum likelihood value s is also the same point at 
which the posterior density P(s\c) peaks. This is true 
only in our method. When an arbitrary Bayesian prior 
is used, the maximum likelihood value is not the same 
point at which the posterior density will peak. Note 
also that the normalization equation J V n (s) ds=l is 
still valid. The integral 



a n ds =/= 1 



(30) 



since a n is our measurement of the value of V n {s) 
at the true value. It is a measure of the statistcal 
accuracy of the experiment. The larger the value of 
a n , the narrower the distribution V n (s) and the more 
accurate the experiment. 



7. Combining Results of Experiments 

Each experiment should publish a likelihood curve 
for its fit as well as a number for the data likelihood 
P data (c n ). Combining the results of two experiments 
with m and n experiments each, involves multiplying 
the likelihood ratios. 



P(c m \s) P(c n \s) 



P data (c m ) P data (c n ) 
(31) 

Posterior densities and goodness of fit can be deduced 
from the combined likelihood ratio. 



8. Interpreting the results of one 
experiment 

After performing a single experiment with n events, 
we now can calculate P(s|c n ), using equation 1271 



Equation 1181 gives the prescription for arriving at 
V n (s), given an ensemble of such experiments, the 
contribution from each experiment being weighted 
by the "data likelihood" P data (c n ) for that experi- 
ment. The "data likelihoods" integrate to unity, i.e 
j p data (c n )dc n = 1. In the case of only a single ex- 
periment, with the observed c n being denoted by c° bs , 



P data (c n ) = 6(c a - c n o6s ) 



(32) 



Eauation ll8l for a single experiment, then reduces to 



Vn(s)= J P(s\c n )P data (c n )dc n = P(s\c n obs ) (33) 



i.e. given a single experiment, the best estimator for 
V n (s), the pdf of s, is P(s\c n obs ) and thus the best 
estimator for the true value st is s* obs deduced from 
the experiment. We can thus use P(s\c n obs ) as though 
it is the pdf of s and deduce limits and errors from it. 
The proviso is of course that these limits and errors 
as well as s* obs come from a single experiment of fi- 
nite statistics and as such are subject to statistical 
fluctuations. 



9. Comparison with the Bayesian 
approach 



In the Bayesian approach, an unknown Bayesian 
prior P(s) is assumed for the distribution of the pa- 
rameter s in the absence of any data. The shape of 
the prior is guessed at, based on subjective criteria or 
using other objective pieces of information. However, 
such a shape is not invariant under transformation of 
variables. For example, if we assume that the prior 
P(s) is flat in s, then if we analyze the problem in s 2 , 
we cannot assume it is flat in s 2 . This feature of the 
Bayesian approach has caused controversy. Also, the 
notion of a pdf of the data does not exist and P(c) 
is taken to be a normalization constant. As such, no 
goodness of fit criteria exist. In the method outlined 
here, we have used Bayes' theorem to calculate poste- 
rior densities of the fitted parameters while being able 
to compute the goodness of fit. The formalism devel- 
oped here shows that what is conventionally thought 
of as a Bayesian prior distribution is in fact a nor- 
malization constant and what Bayesians think of as a 
normalization constant is in fact the pdf of the data. 
Table [n] outlines the major differences between the 
Bayesian approach and the new one. 
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Table II The key points of difference between the 
Bayesian method and the new method. 



Item 


Bayesian Method 


New Method 


Goodness 


Absent 


Now available 


of fit 




in both binned 
and unbinned fits 


Data 


Used in evaluating 


Used in evaluating 




theory pdf 


theory pdf 




at data points 


at data points 

as well as evaluating 

data pdf at data points 


Prior 


Is a distribution 


No prior needed. 




that is guessed based 


One calculates a 




on "degrees of belief" 


constant from data 




Independent of data, 


pdata^ 
I P(c„\s)ds 




monolithic 


— > oo as n — > oo 


Posterior 


Depends on Prior. 


Independent of prior. 


density 

P(s|Cn) 


P(c„|s)P(s) 


same as frequentists use 

P(c„|s) 


f P(c„\s)P(s) ds 


f P(c n \s) ds 



10. Further work to be done 



This scheme involves the usage of two pdf's, namely 
data and theory. In the process of computing the fit- 
ted errors, we have demonstrated that the quantity 
in the joint probability equations that has been inter- 
preted as the "Bayesian prior" is in reality a number 
and not a distribution. This number is the value of 
the pdf of the parameter, which we call the "unknown 
concomitant" at the true value of the parameter. This 
number is calculated from a combination of data and 
theory and is seen to be an irrelevant parameter. If 
this viewpoint is accepted, the controversial practice 
of guessing distributions for the "Bayesian Prior" can 
now be abandoned, as can be the terms "Bayesian" 
and "frequentist" . We show how to use the posterior 
density to rigorously calculate fitted errors. 
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Equation^] can be used to show that the expecta- 
tion value of E(s) of the parameter s is given by 

E(s) = JsV n (s)ds = Jdc n P{c n ) J sP(s|c n )d<34) 

s(c n )P(c n )dc n (35) 
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